Statistical Techniques for Text Classification Based on Word Recurrence Intervals
نویسندگان
چکیده
The decision as to whether two texts were written by the same author is usually a difficult one. Can an analysis of how the words in a text statistically cluster shed some light on authorship? In this paper we examine both English texts and the Greek source texts of the New Testament. The mathematical techniqes developed by Shannon [1,2] and Markov have been used for a number of years to analyse sequences of data, whether this be computer code, text, or DNA. These techniques and other probability-based techniques have enjoyed a large amount of usage in analysing DNA sequences [3] well as both written and spoken text [4,5]. Applications of linguistic methods to DNA sequence analysis have been explored by Dong and Searls [6] and others, and this is our motivation for exploring linguistic techniques for authorship (the corresponding problem in the field of DNA research is the phylogeny of organisms based on their DNA sequences). A seminal work in the area of authorship is Mosteller [7], a good overview of other work can be found in Oakes [8]. Durbin et al. [9] is a good reference of work done in analysing DNA sequences. Ortuño et al. [10] suggest using standard deviation of the ‘inter-word spacing’
منابع مشابه
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملUsing Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents
Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...
متن کاملUsing Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents
Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملارتقای کیفیت دستهبندی متون با استفاده از کمیته دستهبند دو سطحی
Nowadays, the automated text classification has witnessed special importance due to the increasing availability of documents in digital form and ensuing need to organize them. Although this problem is in the Information Retrieval (IR) field, the dominant approach is based on machine learning techniques. Approaches based on classifier committees have shown a better performance than the others. I...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003